Differential expression analysis with unmeasured confounders
Idea:
Differential expression analysis aims to identify genes with differing average expression levels between two or more groups and to quantify these differences as a treatment effect, denoted by \(\beta\). Gene expression data is often subject to unknown technical variation, and failing to account for it can lead to estimators of \(\beta\) with increased variance and/or bias. In this work, we propose and evaluate an estimator of the causal Average Treatment Effect (\(\beta_{ATE}\)) that does not require knowledge (either known or estimated) of a technical factor (\(W\)) present in the model.
Findings
In the simulations, the \(\hat\alpha_{ATE}\) estimator, that describes the effect of the unwanted variation, outperforms the \(\hat\alpha_{RUV4}\) estimator.
On average, the \(\hat\beta_{ATE}\) estimator outperforms the OLS estimator in a model with no information about technical factors.
The \(\hat\beta_{RUV4}\) estimator has a more consistent performance than the \(\hat\beta_{ATE}\) estimator.
The magnitude of \(\delta\) has an effect over the accuracy of the estimations.
The number of hidden unwanted factors affects the treatment effect estimations in RUV4.
After identifying the joint distribution between the unwanted factor \(W\), the treatment, or factor of interest \(X\) and the gene expression matrix \(\textbf{Y}\) we observed in the simulations that a confounder has the same effect over the estimation of \(\beta\) than a mediator.
The ATE estimator:
Let \((Y_{i1}, \ldots, Y_{ip}), \; i = 1, \ldots, n\), denote the sample observations across \(p\) genes and \(n\) subjects, let \(X_i = 0, 1\) denote the exposure indicator, with \(X\sim Bin(1,\pi)\) and let \(W_i\) denote the unobserved/unmeasured technical factor, i.e. a latent factor.
For a particular gene \(j\), consider the model
\[ Y_{ij} = \gamma_j + W_i \alpha_j + X_i \beta_j + \varepsilon_{ij}, \]
or in vector notation
\[ Y_i = \gamma + W_i \alpha + X_i \beta + E_i. \]
It can be proven that the standardisation estimator of the Average Treatment Effect (ATE) is equal to the estimate of \(\beta\):
\[ \frac{1}{n} \sum_{i=1}^n \left[ \widehat{Y}(W = W_i, X = 1) - \widehat{Y}(W = W_i, X = 0) \right] = \hat{\beta}. \]
We first introduce the following notation:
\(\mathbb{E}\{E_i \mid X_i = 0\} = \mu_0\) and \(\mathbb{E}\{E_i \mid X_i = 1\} = \mu_1\).
\(Var\{W \mid X_i = 0\} = \nu_0\) and \(Var\{W \mid X_i = 1\} = \nu_1\).
and assume that:
\(\alpha^t\alpha = 1\).
\(\alpha^t\beta = 0\).
\(\mathbb{E}\{E_i\} = 0\) and \(Var\{E_i\} = \Sigma\).
\(Var\{Y \mid X = 0\} = \Sigma_0\) and \(Var\{Y \mid X = 1\} = \Sigma_1\).
To derive the following expressions:
\[(\Sigma_1 − \Sigma_0)\alpha = (\nu^2_1-\nu^2_0)\alpha.\] \[ \beta = \mu_1 − \mu_0 − \alpha\alpha^t(\mu_1 − \mu_0).\] From the first expression we notice that \(\alpha\) is the only eigenvector of the matrix \(\Sigma_1 − \Sigma_0\), and it can be estimated using the covariance matrices of \(Y|X=1\) and \(Y|X=0\).
From the second expression, we can find \(\hat \beta\) by replacing each parameter with its estimate.
Confounders and mediators
In the previous model, \(W\) and \(X\) have an effect over the outcome variables \(Y_1, ... , Y_p\) . In some instances the latent factor \(W\) is correlated with the exposure \(X\). The correlation might happen because the exposure has also an effect over the latent factor, or because the latent factor has also an effect over the exposure. Both types of relationships are described below:
flowchart LR A[Exposure] --> B[Outcome] A --> C[W] C --> B
flowchart LR E[Exposure] --> F[Outcome] D[W] --> E D --> F
In the left diagram, \(W\) acts as a mediator, while in the right diagram, \(W\) acts as a confounder.
If \(W\) is a mediator, we assume a linear relationship with \(X\):
\[\begin{align} W = U + X\delta, \end{align}\]where \(U\) is an unobserved random noise variable assumed to follow a normal distribution with mean \(\mu_u\) and standard deviation \(\sigma_u\), and is independent of \(X\). The parameter \(\delta\) characterizes the extent of the mediation.
Therefore, the marginal distribution of \(W|X\) is given by:
\[\begin{align} W|X=0 &\sim \mathcal{N}(\mu_u, \sigma_u) \\ W|X=1 &\sim \mathcal{N}(\delta + \mu_u, \sigma_u) \end{align}\]Using the law of total probability, the probability density function (pdf) of \(W\), denoted \(f_w\), is given by:
\[\begin{align} f_w = f_{w|x=1}\pi + f_{w|x=0}(1 - \pi) \end{align}\]Using Bayes’ theorem, the conditional pdf of \(X\) given \(W\), denoted \(f_{x|w}\), is given by:
\[\begin{align} f_{x|w} = \dfrac{f_{w|x}f_x}{f_w} \end{align}\]Without loss of generality, assume that \(\pi = 0.5\), hence \(\pi = 1 - \pi\), and
\[\begin{align} f_{x|w} &= \dfrac{f_{w|x}\pi}{\pi[f_{w|x=1} + f_{w|x=0}]} = \dfrac{f_{w|x}}{f_{w|x=1} + f_{w|x=0}} \end{align}\]Replacing the conditional densities, we obtain:
\[\begin{align} f_{x|w} &= \dfrac{\dfrac{1}{\sqrt{2\pi\sigma_u^2}} \exp\left\{ -\dfrac{[w - (x\delta + \mu_u)]^2}{2\sigma_u^2} \right\}}{\dfrac{1}{\sqrt{2\pi\sigma_u^2}} \left[ \exp\left\{ -\dfrac{[w - (\delta + \mu_u)]^2}{2\sigma_u^2} \right\} + \exp\left\{ -\dfrac{(w - \mu_u)^2}{2\sigma_u^2} \right\} \right]} \end{align}\]In particular, if \(U \sim \mathcal{N}(0, 1)\) (i.e., a standard normal distribution), we obtain:
\[\begin{align} f_{x|w} &= \dfrac{\exp\left\{ \dfrac{w^2}{2} \right\} \exp\left\{ \dfrac{\delta^2 - 2w\delta x}{2} \right\}}{\exp\left\{ \dfrac{w^2}{2} \right\} \left[ \exp\left\{ \dfrac{\delta^2 - 2w\delta}{2} \right\} + 1 \right]} \end{align}\]If \(X = 1\), then:
\[\begin{align} f_{x|w} &= \dfrac{\exp\left\{ \dfrac{\delta^2 - 2w\delta}{2} \right\}}{\exp\left\{ \dfrac{\delta^2 - 2w\delta}{2} \right\} + 1} = \dfrac{1}{\exp\left\{ \dfrac{2w\delta - \delta^2}{2} \right\} + 1} \end{align}\]Analogously, if \(X = 0\), then:
\[\begin{align} f_{x|w} &= \dfrac{\exp\left\{ \dfrac{\delta^2}{2} \right\}}{\exp\left\{ \dfrac{\delta^2 - 2w\delta}{2} \right\} + 1} \end{align}\]Alternatives
We compare the results with:
- A naive linear regression model without the technical factor: \(Y_i = \gamma + X_i \beta + E_i.\)
- The RUV4 model with no negative control genes.
- A gold standard model with the “unobserved’ technical factor: \(Y_i = \gamma + W_i \alpha + X_i \beta + E_i.\)
Simulated data
We simulate the matrix \(\textbf{Y}\) for a sample with \(n=100\) subjects and \(p=50\) genes following the model
\[ Y_i = \gamma + W_i \alpha + X_i \beta + E_i. \]
The parameters \(\alpha\) and \(\beta\) are randomly generated from two different zero mean normal distributions. \(\gamma\) is fixed at 1 for all genes and subjects. Under no confounding or mediation, \(X\) follows a binomial distribution with \(\pi=0.5\) and \(W\) follows a standard normal distribution.
When \(W\) is a mediator, we follow the model proposed by Wang et al. , and characterized the relationship as:
\[\begin{align} W = U + X\delta, \end{align}\]where \(U\) follows a standard normal distribution.
When \(W\) is a confounder, we generate \(X\) from a Bernoulli distribution with \(\pi\) given by \[ \pi_{X|W}= \frac{1}{1+\exp\left\{\frac{2w\delta -\delta^2}{2}\right\}}\]
= , ensuring that \(f_{w|x}\) and \(f_{x|w}\) agree with the underlying joint distribution \(f_{xwy}\).
To create a more realistic setting, scenarios in which a percentage of the \(\beta\) parameters are set to zero are also considered.